Chris Pollett >
Old Classses > |
HW#2 --- last modified February 06 2019 04:15:58..Due date: Mar 16
Files to be submitted: Purpose: Related Course Outcomes: The main course outcomes covered by this assignment are: CLO1 -- Code a basic inverted index capable of performing conjunctive queries. CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics. CLO6 -- Be able to evaluate search results by hand and using TREC eval software. Specification: The homework consists of three parts: Some written problems, a coding exercise, and an evaluations task. Materials for all three parts should be zipped together and submitted as Hw2.zip. If you are working in a team, have a readme.txt listing your team mates and their student ids. Below are the written problems I'd like you to do:
For the coding part of your homework I want you to expand upon the code you wrote for HW1. Now your program will be run from the command-line with a command like: php search_program.php some_dir query ranking_method tokenization_method some_dir is, as in the last assignment, a folder containing plain text file documents. We will assume these documents are written in American English. query is a one or more term search query (if more than one term put the whole query in quotes). ranking_method should be one of cosine or proximity -- this will determine how you are going to rank the results you return. Finally, tokenization_method should be one of none, stem, or chargram. On a reasonable set of command line inputs, your program should use glob to get the *.txt file names in the folder some_dir . We will treat the position in the array returned by glob as the document id of a document. Before your program does tokenization, I want you to first convert the text to lower case and then strip punctuation. Then depending on the tokenization_method , I want you to either make tokens using just white space as the delimiter, or to use white space and then apply either an English stemmer or do chargramming. For the stemming or chargramming, I want you to make use of Yioop as a library using Composer. You can have your project depend on version 3.4.0 of Yioop. For chargramming, you can use the function getNGramsTerm where $n is defined as 5. For this homework, your program does not output the inverted index. Instead, I want you to store these informations into a class which has the four public methods first, last, prev, and next, described in class for the inverted index ADT. Your implementation of prev and next should use galloping search. The command-line argument query above can be either a single word for example "bob", it can be a string of words for example "bob smith". If it is more that one term, when I run your program I will put the sequence of terms in double-quotes, so that it will come in to your program as a single command-line argument. As an example, I might run your program with a line like: php search_program.php ./ "apple records" cosine chargram On an input as above, your program should return a sequence of lines, each line containing a pair (doc_id, rank). Here doc_id should be a document which contains the word apple and contains the word records (not necessarily adjacent to each other). Here rank should be the cosine or proximity rank score for that document with respect to this query. You should output results in descending order of rank. In computing cosine rank, use TF_IDF values for the vector component where the TF function is the example one given in the book. For the last part of your assignment, I want you to create a small text collection of 20 documents and put them in a subfolder my_test_files as part of what you submit. Name the files: file1.txt, file2.txt ... Make up a list of six queries (two single word, two string of words, two three words). For each of your queries and each of the documents in my_test_files make a 1 (yes), 0 (no) judgment about the relevance of that document to that query. Each of your queries should have the key words searched on as well as a description of the words you expect to get back. Number your queries and put your queries and their descriptions in queries.txt. For instance, you might have a query as your query number 1, the query "apple records", with a description of the query being the record company of the Beatles. As part of your description, you might say a document about Apple computers which mentions the word records should not be considered a relevant result for this query. In a separate text file, judgments.txt, you should lists the judgments you came up with above. This file should consist of lines like (query_no, doc_no, relevance) which record this information. View a score of 0 for a document as that document not being returned. Now fix the rank mode as either (your choice) proximity or cosine ranking, I want you to compute by hand or by script the recall, precision, and MAP scores of your search program with respect to each query using your judgments in the case where char-gramming, stemming, or nothing is used. Put your calculations into the file scores.txt are have at least one paragraph where you draw conclusions. Point Breakdown
|